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Abstract 

In this paper, we give an overview for the shared task at the 4th CCF 
Conference on Natural Language Processing & Chinese Computing (NLPCC 
2015): Chinese word segmentation and part-of-speech (POS) tagging for 
micro-blog texts. Different with the popular used newswire datasets, the 
dataset of this shared task consists of the relatively informal micro-texts. The 
shared task has two sub-tasks: (1) individual Chinese word segmentation 
and (2) joint Chinese word segmentation and POS Tagging. Each subtask 
has three tracks to distinguish the systems with different resources. We first 
introduce the dataset and task, then we characterize the different approaches 
of the participating systems, report the test results, and provide a overview 
analysis of these results. An online system is available for open registration 
and evaluation at http : //nip . fudan. edu . cn/nlpcc2015 


1 Introduction 

Word segmentation and Part-of-Speech (POS) tagging are two fundamental tasks 
for Chinese language processing. In recent years, word segmentation and POS tag¬ 
ging have undergone great development. The popular method is to regard these two 
tasks as sequence labeling problem idlH, which can be handled with supervised 
learning algorithms such as Maximum Entropy (ME) lU, averaged perceptron 
Conditional Random Eields (CRE)©. After years of intensive researches, Chinese 
word segmentation and POS tagging achieve a quite high precision. However, their 
performance is not so satisfying for the practical demands to analyze Chinese texts, 
especially for informal texts. The key reason is that most of annotated corpora are 


1 


drawn from news texts. Therefore, the system trained on these corpora cannot work 
well with the out-of-domain texts. 

In this shared task, we focus to evaluate the performances of word segmentation 
and POS tagging on relatively informal micro-texts. 

2 Data 

Different with the popular used newswire dataset, we use relatively informal texts 
from Sina Weihcj^ The training and test data consist of micro-hlogs from various 
topics, such as finance, sports, entertainment, and so on. Both the training and test 
files are UTF-8 encoded. 

The information of dafasef is shown in Table [T] The oul-of-vocahulary (OOV) 
rale is slighl higher lhan Ihe olher benchmark dalasefs. For example, Ihe OOV rale 
is 5.58% in the popular division Q of the Chinese Treebank (CTB 6.0) dataset [8], 
while the OOV rate of our dataset is 7.25%. 


Table 1: Statistical information of dataset. 


Dataset 

Sents 

Words 

Chars 

Word Types 

Char Types 

OOV Rate 

Training 

10,000 

215,027 

347,984 

28,208 

39,71 

- 

Test 

5,000 

106,327 

171,652 

18,696 

3,538 

7.25% 

Total 

15,000 

322,410 

520,555 

35,277 

4,243 

- 


There are total 35 POS tags in this dataset. A detailed list of POS tags is shown 
in Tabled 

'http://weibo.com/ 
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Table 2: Statistical information of POS tags. 


wltTiPOS) 

En 

Num 


NN 

84,006 



PER 

3,232 


ORG 

2,578 


LOG 

9,701 


NR 

550 


EML 

3 


MOD 

34 

~WS: 

URL 

11 

S'JwI 

SNS'JwI 

ADQ 

340 

HlJinJ 

AD 

26,155 

mm 

m^m 

JJ 

9,477 

fhiiinj 

VA 

3,339 

iJjv! 

gflif] 

VV 

51,294 


MV 

3,700 


DV 

781 

ffisfiifl 

BEI 

927 

~imm] 

BA 

600 

1 Btl'sJfe® 

NT 

5,881 


i5]1 

[■TtPOS) 

Labels 

Num 



PNP 

4,903 


Ml'nJAiTj 

PNQ 

492 



PNI 

834 


~1WIWW~ 

CC 

2,725 

ASSwl 

CS 

866 


SwI 

CD 

10,764 


fiiTJ 

M 

7,917 



OD 

1,219 


Afvwl 

LC 

4,725 



ETC 

673 


inAiBj 

SP 

1,076 



DT 

3,579 


nXwl 

IJ 

20 



PU 

52,922 



DSP 

13,756 


mm 

P 

9,488 



AS 

3,382 


2.1 Background Data 

Besides the training data, we also provide the background data, from which the 
training and test data are drawn. The purpose is to find the more sophisticated 
features by the unsupervised way. 


3 Description of the Task 

In this shared task, we wish to investigate the performances of Chinese word seg¬ 
mentation and POS tagging for the micro-blog texts. 

3.1 Subtasks 

This task focus the two fundamental problems of Chinese language processing: 
word segmentation and POS tagging, which can be divided into two subtasks: 

1. SEG Chinese word segmentation 

2. S&T Joint Chinese word segmentation and POS Tagging 
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3.2 Tracks 


Each participant will be allowed to submit the three runs for each subtask: closed 
track run, semi-open track run and open track run. 

1. In the closed track, participants could only use information found in the 
provided training data. Information such as externally obtained word counts, 
part of speech information, or name lists was excluded. 

2. In the semi-open track, participants could use the information extracted from 
the provided background data in addition to the provided training data. Infor¬ 
mation such as externally obtained word counts, part of speech information, 
or name lists was excluded. 

3. In the open track, participants could use the information which should be 
public and be easily obtained. But it is not allowed to obtain the result by the 
manual labeling or crowdsourcing way. 


4 Participants 

Sixteen teams have registered for this task. Finally, there are 27 qualified submitted 
results from 10 teams. A summary of qualified parficipafing feams are shown in 
Table [2 


Table 3: Summary of fhe parficipanfs. 



SEG 

S&T 


closed 

open 

semi-open 

closed 

open 

semi-open 

NJU 

x/ 

x/ 

x/ 




BosonNLP 

x/ 

x/ 


x/ 

x/ 


CIST 

x/ 


x/ 

x/ 


x/ 

XUPT 

x/ 



x/ 



CCNU 

x/ 

x/ 





ICT-NLP 

x/ 






BJTU 

x/ 

x/ 

x/ 

x/ 

x/ 

x/ 

szu 


x/ 



x/ 


zzu 



x/ 




WHU 




x/ 


x/ 
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5 Results 


5.1 Evaluation Metrics 

The evaluation measure are reported are precision, recall, and an evenly-weighted 
FI. 

5.2 Baseline Systems 

Currently, the mainstream method of word segmentation is discriminative character- 
based sequence labeling. Each character is labeled as one of {B, M, E, S} to indi¬ 
cate the segmentation. {B, M, E} represent Begin, Middle, End of a multi-character 
segmentation respectively, and S represents a Single character segmentation. 

Eor the joint word segmentation and POS tagging, the state-of-the-art method 
is also based on sequence learning with cross-labels, which can avoid the problem 
of error propagation and achieve higher performance on both subtasks |4|. Each 
label is the cross-product of a segmentation label and a tagging label, e.g. {B-NN, 
I-NN, E-NN, S-NN, ...}. The features are generated by position-based templates 
on character-level. 

Sequence labeling is the task of assigning labels y = yi,..., to an input 
sequence x = xi,..., x„. Given a sample x, we define the feature <h(x, y). Thus, 
we can label x with a score function, 

y = argmaxF(w, 4>(x,y)), (1) 

y 

where w is the parameter of function F{-). 

Eor sequence labeling, the feature can be denoted as (pkiui, yt-i, x, i), where 
i stands for the position in the sequence and k stands for the number of feature 
templates. 

Here, we use two popular open source toolkits for sequence labeling task as 
the baseline systems: ENEF0@ and CRE-i-t0 Here, we use the default setting of 
CRE-i-i- toolkit with the feature templates as shown in Table The same feature 
templates are also used for ENEP. 

5.3 Chinese word segmentation 

In word segmentation task, the best El performances are 95.12, 95.52 and 96.65 
for closed, semi-open and open tracks respectively. The best system outperforms 
the baseline systems on closed track. The best system on semi-open track is better 

^https://github.com/xpqiu/fnlp/ 

^ http://taku910.github.io/crfpp/ 
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Table 4: Templates of CRF++ and FNLP. 


unigram feature 

C-2, C_i, Co, C+i, 

C+2 

bigram feature 

C-1 o Co, Co o C+1 

trigram feature 

C_2OC_iOC0, C_iO 

CoOC+1, CoOC+lOC+2 


than that on elosed traek. Unsurprisingly, the performanees boost greatly on open 
traek. 


Table 5: Performanees of word segmentation. 


Systems 

Preeision 

Reeall 

FI 

Traek 

CRF-t-t 

93.3 

93.2 

93.3 

baseline, elosed 

FNLP 

94.1 

93.9 

94.0 

NJU 

95.14 

95.09 

95.12 

elosed 

BosonNLP 

95.03 

95.03 

95.03 

CIST 

94.78 

94.42 

94.6 

XUPT 

94.61 

93.85 

94.22 

CCNU 

93.95 

93.45 

93.7 

ICT-NLP 

93.96 

92.91 

93.43 

BJTU 

89.49 

93.55 

91.48 

CIST 

95.47 

95.57 

95.52 

semi-open 

NJU 

95.3 

95.31 

95.3 

BJTU 

90.91 

94.46 

92.65 

ZZU 

85.36 

85.25 

85.31 

BosonNLP 

96.56 

96.75 

96.65 

open 

NJU 

96.03 

96.15 

96.09 

szu 

95.52 

95.64 

95.58 

CCNU 

93.68 

93.09 

93.38 

BJTU 

91.79 

94.92 

93.33 


5.4 Joint Chinese word segmentation and POS Tagging 

In the joint word segmentation and POS tagging, the best performanees are 88.93, 
88.69 and 91.55 for elosed, semi-open and open traeks respeetively. 
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Table 6: Performances of joint word segmentation and POS tagging. 


Systems 

Precision 

Recall 

FI 

Track 

BosonNLP 

88.91 

88.95 

88.93 

closed 

XUPT 

88.54 

87.83 

88.19 

BJTU 

88.28 

87.67 

87.97 

CIST 

88.09 

87.76 

87.92 

BJTU 

80.64 

85.1 

82.81 

CIST 

88.64 

88.73 

88.69 

semi-open 

WHU 

88.59 

87.96 

88.27 

BJTU 

81.76 

85.82 

83.74 

BosonNLP 

91.42 

91.68 

91.55 

open 

szu 

88.93 

89.05 

88.99 

BJTU 

79.85 

83.51 

81.64 


6 Analysis 

7 Conclusion 

After years of intensive researches, Chinese word segmentation and POS tagging 
have achieved a quite high precision. However, the performances of the state- 
of-the-art systems are still relatively low for the informal texts, such as micro¬ 
blogs, forums. The NLPCC 2015 Shared Task on Chinese Word Segmentation and 
POS Tagging for Micro-blog Texts focuses on the fundamental research in Chinese 
language processing. 

It is the first time to use the micro-texts to evaluate the performance of the 
state-of-the-art methods 

In future work, we hope to run an online evaluation system to accept open 
registration and submission. Currently, a simple system is available at http: 
//nip . fudan . edu . cn/nlpcc2 015 The system also gives the leaderboards 
for the up-to-date results under the different tasks and tracks. Besides, we also 
wish to extend the scale of corpus and add more informal texts. 
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